Annotated suffix trees for text modelling and classification
نویسنده
چکیده
Suffix trees are compact and versatile data structures in which paths from the root to nodes represent substrings of the encoded text. By annotating such a tree with the frequencies of substrings, it is possible to construct a compact model of text that captures its sequential nature. This thesis investigates the use of such a model in the representation and classification of text. The basic approach in this thesis is to use an Annotated Suffix Tree (AST) to represent a pre-specified collection of texts (“class”). A document, represented as a string or another (“auxiliary”) suffix tree, is matched to the AST to allow, firstly, the scoring of matches between the document and the AST and, secondly, the identification of a number of substrings (“features”) that maximally contribute to the matching score. Based on this, methods are proposed for the interrelated problems of: (i) classification of text against several, possibly overlapping, classes, (ii) highlighting the features in a text which are most relevant to a particular class (this problem, to our knowledge, has never before been computationally addressed). The developed methods are applied to well-established text analysis problems such as e-mail spam filtering and document classification, with three aims in mind: (i) to adjust parameters of the scoring function and assess the effect on performance, (ii) to test the method on benchmark and newly developed test sets, and (iii) to generate human-readable evaluations of classification features within query documents. Experiments show that the AST method is competitive with other current approaches and in some cases, such as spam filtering, achieves higher classification accuracy; the method also allows the tackling of problems not typically addressed by current alternative methods. The AST method is therefore a useful addition to the arsenal of available classification methods.
منابع مشابه
Annotated Suffix Trees for Text Clustering
In this paper an extension of tf -idf weighting on annotated suffix tree (AST) structure is described. The new weighting scheme can be used for computing similarity between texts, which can further serve as in input to clustering algorithm. We present preliminary tests of using AST for computing similarity of Russian texts and show slight improvement in comparison to the baseline cosine similar...
متن کاملBidirectional Construction of Suffix Trees
String matching is critical in information retrieval since in many cases information is stored and manipulated as strings. Constructing and utilizing a suitable data structure for a text string, we can solve the string matching problem efficiently. Such a structure is called an index structure. Suffix trees are certainly the most widely-known and extensively-studied structure of this kind. In t...
متن کاملKernels and Similarity Measures for Text Classification
Measuring similarity between two strings is a fundamental step in text classification and other problems of information retrieval. Recently, kernel-based methods have been proposed for this task; since kernels are inner products in a feature space, they naturally induce similarity measures. Information theoretic (dis)similarities have also been the subject of recent research. This paper describ...
متن کاملSuffix Trees and their Applications in String Algorithms
The suffix tree is a compacted trie that stores all suffixes of a given text string. This data structure has been intensively employed in pattern matching on strings and trees, with a wide range of applications, such as molecular biology, data processing, text editing, term rewriting, interpreter design, information retrieval, abstract data types and many others. In this paper, we survey some a...
متن کاملCompact Suffix Trees Resemble PATRICIA Tries: Limiting Distribution of the Depth
Suffix trees are the most frequently used data structures in algorithms on words. In this paper, we consider the depth of a compact suffix tree, also known as the PAT tree, under some simple probabilistic assumptions. For a biased memoryless source, we prove that the limiting distribution for the depth in a PAT tree is the same as the limiting distribution for the depth in a PATRICIA trie, even...
متن کامل